Introduction to Data Science in Python by Appsilon

Course introduction

Piotr Pasza Storożenko@Appsilon

Introduction

About me

Piotr Pasza Storożenko, Machine Learning Engineer

A bit:

  • ML guy
  • Physicist
  • Computer scientist
  • Mathematician

you can find more on my blog: pstorozenko.github.io.

Why this course?

Through the years I got a lot of knowledge about python, R, julia, data science, machine learning, deep learning, software development.

Now I can share it with you!

What will be interesting for whom during this course?

  • Computer scientists
    • Easy to use and efficient tools to work with data
    • Made by software developers
  • Mathematicians
    • Intuitive tools that support reproducible experiments
    • Ridiculously easy work with plots
  • Electrical and mechanical engineers
    • Great alternative to matlab
    • Easy to use
    • Simple plots and animations
  • Economics students
    • Great alternative for spreadsheets
    • Set of free, open source tools
    • Uncomparable greater control of data when compared to MS Office, Tableau, PowerBI etc.
  • Physicists, chemists, biologists
    • Substential relief from spreadsheets
    • Open-source software
    • Much easier to create, reproducible plots

Data Science

What’s Data Science?

Why Data Science?

Explanation of various terms

  • Artificial Inteligence, AI – an umbrella term connected to everything where the computer/system makes decisions based on a set of rules, on an algorithm.
  • Data Science (DS) – everything related to data, from collecting, through processing, up to displaying and using
  • Machine Learning, ML – everything related to creating/training models that are able to learn rules based on provided data
  • [Deep] Neural Networks, [D]NN – subset of ML methods based on special class of models, so called neural networks. Their architecture (design) somehow resembles connections between biological neurons, hence the name.

Who’s a Data Scientist?

Someone who simultaneously:

  1. Discusses with so called bussiness the topic of required solutions.
  2. Based on provided and collected data creates solutions using programming skills.
  3. Delivers the results to the bussiness in a clear and interesing way.

Bussiness talks about AI, experts preffer to say ML

Data Science Workflow

Data Science Tools

Course plan

  1. Introduction and numpy - working with numbers
  2. pandas - working with data frames
  3. matplotlib and plotly - plotting data
  4. scikit-learn - introduction to machine learning
  5. streamlit, quarto, fastapi - sharing your work

Course plan

Data Science Workflow x This course

Data Science Workflow x This course ++

Tools used in the course

Tools used in the course

  • Python 3.9/3.10 via Anaconda + many additional packages
  • Visual Studio Code aka VS Code

Anaconda

Anaconda is the standard when it comes to managing python environemnts in the data science/machine learning community. It allows to obtain a consistent environment among various systems.

Why is it that important?

Data scienctists often work on many projects at the same time. Each project might require a different environment, with specific versions of python and other libraries.

This might be also a relief when working on different projects during studying!

Anaconda - How to create an environment

After installing anaconda, you have to clone this course repo, and move yourself in terminal into course repo. Then run:

conda create -n appsilon-ds-course python=3.10 -y
conda activate appsilon-ds-course
pip install -r requirements.txt

If you received the following error message:

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

you’re in the wrong directory.

In case of problems, check out the official conda tutorial.

VSCode - code editor for 2022

Why VS Code?

  • Great support for both python scripts and jupyter notebooks.
  • Automatically detects conda environments
  • Great support for working with remote machines through SSH (althogh we will not use this feature)
  • One tool to work with python, R, julia, javascript, typescript etc.
  • Above all – VS Code is free

What to do after installing VS Code?

Install extensions

Environment for work and studing is ready!

pip vs conda vs pipenv vs …

We need multiple environments on a single machine.

How to live, what to use?

NEVER PLAY WITH DATA SCIENCE ON YOUR DEFAULT SYSTEM’S PYTHON

pip + virtualenv

  1. A basic package manager included in python
  2. Works only for a single version of python
  3. Capable of installing python packages only
  4. Basic package versioning with pip freeze
  5. Pretty fast when it doesn’t have to build packages

conda

  1. A package manager provided by Anaconda
  2. Allows for creating different environments for different major (3.9/3.10) and minor (3.10.3/3.10.4) python versions
  3. Is able to install other software than python packages as well (e.g. R or CUDA drivers)
  4. Basic package versioning with conda list --export
  5. Super slow for bigger environments
  6. Packages installed with conda can be shared across environments – lower disk usage (just PyTorch is ~1.7GB)

pipenv

  1. Looks like pip + virtualenv plus different python versions
  2. Very big focus on environment reproducibility
  3. Super slow for bigger environments

How to live?

The most reliable setup for experimenting is to do:

conda create -n my-env python==3.10.4
conda activate my-env
pip install ...

If you need to install CUDA drivers then do it during environment creation conda create -n my-env python cudatoolkit.

After you install all packages, save the python version in your README file e.g.,

Project created with python 3.10.4.

and store installed packages with pip freeze > requirements.txt.

Remember that not every package version is available for every python version. For example Tensorflow 2.10 is supported only in python>=3.10.